1 Introduction to R for spatial modeling

1.1 Basic data management

1.1.1 Reading data

1.1.1.1 Setting up a working directory (working directory)

As with most of the projects, we start choosing a working directory to indicate R where are our files.

getwd()                       # Current working directory 
setwd("C:/GEO77_R_course")    # Set a working directory
setwd("C:\\GEO77_R_course")   # another valid way to select the same working directory
setwd("C:\GEO77_R_course")    # an invalid way of selecting the same working directory

1.1.1.2 And if I do not want to write code?

  • or go to Session/Set Working Directory/Choose directory… and select folder

  • or press Ctrl+⇧Shift+H and select folder

Why?

  • Clarity
  • Simplification of data management
  • Facilitates access to our data

1.1.2 Loading data

R has the ability to load multiple file formats, which can be further extended through packages.

Reading a dataset read.table()

soil_data <- read.table("./data/soil_data.txt", header = TRUE)
names(soil_data)
##  [1] "ID"            "X_coord"       "Y_coord"       "Id"           
##  [5] "pH"            "P"             "K"             "Mg"           
##  [9] "horiz"         "upper_depth"   "lower_depth"   "stones"       
## [13] "coarse_sand"   "medium_sand"   "fine_sand"     "coarse_silt"  
## [17] "medium_silt"   "fine_silt"     "clay"          "soil_type_ka4"
## [21] "pH_cacl2"
str(soil_data)
## 'data.frame':    57 obs. of  21 variables:
##  $ ID           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ X_coord      : int  4428062 4427932 4428028 4428092 4428089 4428105 4428122 4428129 4428086 4427954 ...
##  $ Y_coord      : int  5749930 5749890 5749724 5749577 5749360 5749189 5749044 5748889 5748785 5748947 ...
##  $ Id           : int  110 111 112 113 114 115 116 117 118 119 ...
##  $ pH           : num  6.3 7 6.1 6.6 6 5.5 5.5 5.2 5.2 5.3 ...
##  $ P            : num  25.4 88 11.7 7.8 4.6 4.4 6.4 5.5 12.6 9.5 ...
##  $ K            : num  16 23 22 18 16 19 15 13 17 17 ...
##  $ Mg           : num  7.3 7.4 8.2 7.8 7.6 6.9 6 5.9 5.8 6.7 ...
##  $ horiz        : chr  "Ap" "Ap" "Ap" "Ap" ...
##  $ upper_depth  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lower_depth  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ stones       : num  1.1 3.4 1.5 0.04 0.2 0.4 1.4 1.3 26.2 24.7 ...
##  $ coarse_sand  : num  1.4 3.2 2.1 0.8 0.6 0.8 2 2.2 8.3 8.9 ...
##  $ medium_sand  : num  5.8 14.7 6.5 1.1 1.1 1.5 2.9 2.9 8.7 9.7 ...
##  $ fine_sand    : num  27.6 6.9 3.9 1.7 1.7 2.8 3.7 3.9 8.9 9.3 ...
##  $ coarse_silt  : num  39.5 42.1 47.1 53.3 52.3 52.8 51.5 52.4 43.9 42.9 ...
##  $ medium_silt  : num  11 10.6 13.7 13.7 16.1 15.2 14.8 14.5 10 12 ...
##  $ fine_silt    : num  2.9 5.9 6.2 6.5 7.6 6.6 6.6 7.2 6.2 5.6 ...
##  $ clay         : num  11.8 16.6 20.5 22.9 20.6 20.3 18.5 16.9 14 11.6 ...
##  $ soil_type_ka4: chr  "Uls" "Uls" "Ut4" "Ut4" ...
##  $ pH_cacl2     : num  7.33 7.26 5.87 6.79 5.77 6.3 6.46 5.79 5.21 7.25 ...
ID X_coord Y_coord Id pH P K Mg horiz upper_depth lower_depth stones coarse_sand medium_sand fine_sand coarse_silt medium_silt fine_silt clay soil_type_ka4 pH_cacl2
0 4428062 5749930 110 6.3 25.4 16 7.3 Ap 0 3 1.10 1.4 5.8 27.6 39.5 11.0 2.9 11.8 Uls 7.33
1 4427932 5749890 111 7.0 88.0 23 7.4 Ap 0 3 3.40 3.2 14.7 6.9 42.1 10.6 5.9 16.6 Uls 7.26
2 4428028 5749724 112 6.1 11.7 22 8.2 Ap 0 3 1.50 2.1 6.5 3.9 47.1 13.7 6.2 20.5 Ut4 5.87
3 4428092 5749577 113 6.6 7.8 18 7.8 Ap 0 3 0.04 0.8 1.1 1.7 53.3 13.7 6.5 22.9 Ut4 6.79
4 4428089 5749360 114 6.0 4.6 16 7.6 Ap 0 3 0.20 0.6 1.1 1.7 52.3 16.1 7.6 20.6 Ut4 5.77
5 4428105 5749189 115 5.5 4.4 19 6.9 Ap 0 3 0.40 0.8 1.5 2.8 52.8 15.2 6.6 20.3 Ut4 6.30
6 4428122 5749044 116 5.5 6.4 15 6.0 Ap 0 3 1.40 2.0 2.9 3.7 51.5 14.8 6.6 18.5 Ut4 6.46
7 4428129 5748889 117 5.2 5.5 13 5.9 Ap 0 3 1.30 2.2 2.9 3.9 52.4 14.5 7.2 16.9 Ut3 5.79
8 4428086 5748785 118 5.2 12.6 17 5.8 Ap 0 3 26.20 8.3 8.7 8.9 43.9 10.0 6.2 14.0 Uls 5.21
9 4427954 5748947 119 5.3 9.5 17 6.7 Ap 0 3 24.70 8.9 9.7 9.3 42.9 12.0 5.6 11.6 Uls 7.25
10 4427887 5749114 120 6.5 18.9 22 7.0 Ap 0 3 10.50 8.4 8.0 8.7 40.3 11.5 6.1 17.0 Lu 6.27
11 4427836 5749319 121 6.1 7.2 28 9.1 Ap 0 3 1.20 2.6 2.5 6.4 48.0 15.2 4.5 20.8 Ut4 6.81
12 4427950 5749243 122 5.5 7.1 21 7.0 Ap 0 3 33.90 7.3 8.8 7.3 44.3 12.5 4.3 15.5 Uls 5.86
13 4427929 5749496 123 6.0 5.1 19 7.6 Ap 0 3 0.20 0.7 1.5 2.6 52.0 15.7 5.8 21.7 Ut4 6.04
14 4427807 5749625 124 6.4 7.9 22 9.5 Ap 0 3 0.60 1.1 2.2 4.3 50.6 13.7 6.0 22.1 Ut4 6.29
15 4427786 5749495 125 6.3 9.5 34 7.6 Ap 0 3 1.00 2.2 2.6 4.5 51.5 13.4 6.0 19.8 Ut4 5.88
16 4427579 5749612 126 6.9 21.1 25 5.1 Ap 0 3 8.70 2.4 3.0 7.3 44.6 12.5 7.6 22.6 Lu 7.05
17 4427668 5749765 127 6.7 17.0 24 6.6 Ap 0 3 12.70 5.7 7.3 7.7 42.9 10.3 4.0 22.1 Lu 7.10
18 4427780 5749847 128 6.2 31.3 31 9.7 Ap 0 3 0.50 1.3 3.2 5.7 47.8 11.7 6.4 23.9 Ut4 5.49
19 4427563 5749885 129 7.1 23.6 29 5.9 Ap 0 3 3.30 2.9 3.8 5.0 47.6 11.6 5.1 24.0 Lu 7.36
20 4427461 5749819 130 7.0 24.3 32 7.3 Ap 0 3 0.90 3.2 3.1 13.3 42.5 16.2 5.0 16.7 Uls 7.02
21 4427347 5749787 131 6.8 20.1 27 6.8 Ap 0 3 1.70 10.2 5.5 31.9 27.8 7.5 4.2 12.9 Sl4 6.91
22 4427303 5749690 132 7.2 25.0 21 3.0 Ap 0 3 2.60 28.8 7.5 23.1 19.1 5.8 3.8 11.9 Sl3 7.29
23 4427560 5749481 133 7.0 20.5 18 4.1 Ap 0 3 6.60 21.2 6.9 40.2 14.8 2.9 3.1 10.9 Sl3 6.78
24 4427704 5749249 134 6.7 11.1 20 7.0 Ap 0 3 32.80 22.1 12.2 16.8 23.3 8.7 6.9 10.0 Sl3 6.35
25 4427785 5749011 135 7.1 7.6 17 8.1 Ap 0 3 3.20 0.3 0.7 4.6 57.4 14.8 6.6 15.6 Ut3 7.49
26 4427618 5749083 136 6.2 5.9 22 8.7 Ap 0 3 0.60 0.8 1.7 6.8 51.5 15.9 7.2 16.1 Ut3 6.22
27 4427660 5748900 137 6.5 6.1 19 10.4 Ap 0 3 0.30 0.5 0.8 5.2 52.5 14.6 7.5 18.9 Ut4 6.80
28 4427849 5748858 138 7.2 8.5 16 7.5 Ap 0 3 0.20 0.4 1.0 6.7 46.1 17.5 8.6 19.7 Ut4 7.40
29 4427440 5748925 139 5.7 3.4 16 8.8 Ap 0 3 0.20 0.3 0.8 5.1 48.1 17.6 6.1 22.0 Ut4 5.16
30 4427476 5749111 140 6.1 3.6 14 8.2 Ap 0 3 0.20 0.5 0.8 4.7 53.1 13.0 6.6 21.3 Ut4 6.56
31 4427512 5749330 141 6.6 7.0 20 7.9 Ap 0 3 0.40 0.4 0.7 4.3 58.7 15.1 4.2 16.6 Ut3 7.46
32 4427357 5749507 142 7.0 11.4 18 5.6 Ap 0 3 0.30 1.4 1.9 11.0 47.1 15.0 4.9 18.7 Ut4 7.19
33 4427194 5749687 143 7.3 16.5 26 6.9 Ap 0 3 0.20 1.1 1.4 7.2 51.4 15.1 6.6 17.2 Ut4 7.41
34 4427095 5749673 144 7.4 13.7 18 8.8 Ap 0 3 0.20 0.8 1.1 5.2 50.7 14.5 6.8 20.9 Ut4 7.42
35 4427179 5749506 145 7.3 12.2 19 6.4 Ap 0 3 0.20 0.3 0.9 7.6 50.3 16.5 7.7 16.7 Ut3 7.46
36 4427373 5749355 146 7.1 11.0 24 7.5 Ap 0 3 11.10 2.8 2.8 16.8 30.4 9.3 8.5 29.4 Lt2 7.31
37 4427351 5749188 147 6.3 4.1 19 8.3 Ap 0 3 0.70 1.3 1.7 7.3 48.4 14.3 7.6 19.4 Ut4 6.51
38 4427197 5749195 148 5.8 6.7 25 7.9 Ap 0 3 5.00 2.7 2.9 13.6 43.7 11.7 7.0 18.4 Lu 4.93
39 4427135 5749339 149 7.2 7.7 29 6.2 Ap 0 3 15.90 7.8 5.7 14.6 26.0 12.2 9.0 24.7 Ls2 7.43
40 4427010 5749504 150 7.3 11.3 17 8.8 Ap 0 3 0.20 0.4 0.7 8.1 46.4 18.6 9.6 16.2 Ut3 7.33
41 4426918 5749618 151 7.2 13.7 21 10.2 Ap 0 3 0.10 0.7 1.3 5.7 47.6 18.7 6.2 19.8 Ut4 7.32
42 4426723 5749512 152 7.3 15.8 18 6.8 Ap 0 3 1.60 1.0 0.9 10.7 47.4 15.2 6.2 18.6 Ut4 7.44
43 4426881 5749397 153 7.3 13.5 18 11.3 Ap 0 3 0.10 0.3 1.0 7.3 60.1 10.4 5.6 15.3 Ut3 7.44
44 4426941 5749277 154 7.2 15.0 28 7.7 Ap 0 3 1.10 1.7 2.0 22.0 40.2 12.4 6.5 15.2 Uls 7.16
45 4427011 5749134 155 6.8 11.0 23 8.5 Ap 0 3 1.20 1.3 1.5 15.5 40.9 15.1 6.6 19.1 Lu 6.44
46 4427115 5749038 156 6.4 21.8 29 8.4 Ap 0 3 9.20 2.5 2.8 15.6 42.6 15.1 6.6 14.8 Uls 6.37
47 4427237 5748957 157 5.6 6.4 20 8.2 Ap 0 3 0.50 0.5 2.5 8.5 44.5 17.8 7.5 18.7 Ut4 5.01
48 4426845 5749062 158 7.1 11.4 18 8.7 Ap 0 3 13.80 1.1 1.2 40.8 28.7 7.8 4.7 15.7 Slu 7.39
49 4426762 5749297 159 7.2 14.6 22 7.1 Ap 0 3 0.10 0.4 0.9 7.4 49.5 15.3 7.5 19.0 Ut4 7.41
50 4426567 5749453 160 7.2 19.1 20 8.1 Ap 0 3 1.20 0.7 0.9 10.3 54.6 11.9 8.1 13.5 Ut3 7.09
51 4426632 5749287 161 7.1 12.1 17 10.8 Ap 0 3 0.10 0.4 1.0 8.8 48.4 14.9 8.1 18.4 Ut4 7.18
52 4426676 5749108 162 7.3 14.2 18 7.5 Ap 0 3 5.00 1.3 2.1 18.2 43.7 12.3 4.6 17.8 Lu 7.41
53 4426481 5749154 163 7.3 18.2 17 8.6 Ap 0 3 1.60 1.4 1.6 19.6 36.1 12.7 4.8 23.8 Lu 7.43
54 4426426 5749333 164 7.3 25.5 20 8.0 Ap 0 3 1.00 1.4 1.2 9.7 52.4 12.3 5.5 17.5 Ut4 7.44
55 4426255 5749196 165 6.9 14.5 17 8.5 Ap 0 3 0.60 1.1 1.4 14.5 47.5 14.6 5.0 15.9 Ut3 6.06
56 4426093 5749162 166 7.2 21.2 19 6.2 Ap 0 3 1.30 1.3 1.8 15.1 50.0 12.5 4.4 14.9 Ut3 7.31

1.1.2.1 don’t forget!

1.1.3 Writing data write.table()

R also allows to export our output/results in different formats:

write.table(soil_data, file = "output/this_is_a_copy_of_soil_data_made_with_R.csv", sep = ",",
            row.names = FALSE, col.names = TRUE)

1.2 Statistical evaluation

When we work with data, is important to describe our data to understand it better. There are a wide variety of statistics used to describe or summarize data in terms of central tendency, shape, position, etc.

pH <- soil_data$pH
K <- soil_data$K

1.2.1 Descriptive statistics

mean(pH)
## [1] 6.624561
median(pH)
## [1] 6.8
min(pH)
## [1] 5.2
max(pH)
## [1] 7.4
range(pH)
## [1] 5.2 7.4
sd(pH)
## [1] 0.6467724
quantile(pH)
##   0%  25%  50%  75% 100% 
##  5.2  6.2  6.8  7.2  7.4
summary(pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.200   6.200   6.800   6.625   7.200   7.400

some functions are not included in base R, but implemented via packages. E.g. moments

library(moments)
skewness(pH)
## [1] -0.7153593

Skewness indicates if our data distribution is symmetric or not:

kurtosis(pH)
## [1] 2.315821

Kurtosis describe is how long is the tail of our data distribution:

1.2.2 Correlation

Pearson correlation coefficient cor()

\[{r_{xy}=\frac{\sum_{i = 1}^{n}(x_i-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i}-\overline{x})^2}{\sqrt{\sum_{i = 1}^{n}(y_{i}-\overline{y})^2}}}}\]

where:

  • \(n\) is sample size
  • \(X_{i}\), \(Y_{i}\) are the individual sample points indexed with \(i\)
  • \(\overline{x} = \frac{1}{n}\sum_{x = 1}^{n} x_{i}\) (the sample mean);

and analogously for \({\bar {y}}\)

Examples:

So now with our data…

cor(pH, K)
## [1] 0.1480799

Spurious correlations

1.2.3 Linear regression lm()

in R, the linear models are written as lm(y ~ x, data)

Example:

# loading some example data
pH <- c(7.25, 7.91, 5.43, 8.96, 3.21, 4.32, 6.45, 7.22, 9.1, 6.77, 8.43, 5.51, 7.5, 6.64, 4.55, 2.18, 9.79, 7.65)
Corg <- c(4.23, 3.22, 3.1, 7.39, 3.11, 2.37, 5.51, 5.01, 7.26, 6.4, 6.38, 3.37, 8.29, 6.93, 2.87, 0.12, 7.57, 9.4)

data <- as.data.frame(cbind(pH, Corg))
linearModel <- lm(pH ~ Corg , data = data)
summary(linearModel)
## 
## Call:
## lm(formula = pH ~ Corg, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0397 -0.9752 -0.1567  0.9637  2.5869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.1757     0.7173   4.427 0.000423 ***
## Corg          0.6669     0.1264   5.276 7.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.289 on 16 degrees of freedom
## Multiple R-squared:  0.635,  Adjusted R-squared:  0.6122 
## F-statistic: 27.84 on 1 and 16 DF,  p-value: 7.535e-05
plot(linearModel)

When running this command, we obtain a set of 4 plots that are out of the scope of this “introduction”, but for the curious ones:

  • Residuals vs Fitted : The residuals are distributed following a systematic pattern around the value 0, indicating that the linear regression is not the best. The residuals are also more concentrated in the center, while towards the extremes they show less dispersion, which could indicate heterogeneity among the error variances (heteroscedasticity). Some residuals stand out, indicating the possible presence of outliers.
  • Normal Q-Q: It compares a theoretical normal distribution with the residuals of our model. It should show a straight line for normality assumption and should not show systematic pasterns (should be randomly distributed around that straight line).
  • Scale-location: it shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.
  • Residuals vs leverage: Unlike the other plots, this time the patterns are not relevant. We look for outlying values in the upper or lower right corner. These places are where cases with a large weight in the linear regression can be located. Look for cases outside the dotted lines, which means they have high Cook’s distance values.

This is just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.

Source: Understanding Diagnostic Plots for Linear Regression Analysis

1.2.4 Coefficient of determination with Pearson correlation coefficient cor()

\[R^2≡1-\frac{\sum(y_{i}-\hat{y_{i}})^2}{\sum(y_{i}-\overline{y})^2}\]

\[R^2=r×r\]

cor(data$pH, linearModel$fitted.values) * 
cor(data$pH, linearModel$fitted.values)
## [1] 0.6349996

1.2.5 root mean squared error (RMSE)

\[RMSE=\sqrt{\frac{1}{n}\sum_{i = 1}^{n}{(\hat{y_{i}}-y_{i})^2}}\]

sqrt(1 / length(rbind(linearModel$residuals * sum(rbind(linearModel$residuals) ^ 2))))
## [1] 0.2357023

1.2.6 Function for the root mean squared error (RMSE)

Base R does not include a function for RMSE in base R, but in R we can create custom functions as follows:

function_name <- function(input_x, input_y, ...) {
  some_operations
}
rmse <- function(x, y) {
  sqrt(mean((x - y) ^ 2))
}

Requires input

rmse(linearModel$fitted.values, Corg)
## [1] 1.668019

1.3 Visualization with plots

The R Graph Gallery

1.3.1 Boxplots boxplot()

boxplot(pH, Corg, names = c("pH value", "Corg content"), ylab = "pH value and Corg content")

  • Data to be displayed: pH and Corg
  • Labeling of the x-axis: names = c(“pH value”, “Corg content”)
  • Labeling of the y-axis: ylab = “pH value and Corg content”.

What does the boxplot show?

1.3.2 Histograms hist()

hist(pH,
     xlab = "pH value", 
     ylab = "Frequency", 
     main = ""
     )

1.3.3 Density plots plot(density()

plot(density(pH), 
     xlab = "pH value",
     ylab = "Frequency", 
     main = ""
     )

1.3.4 Scatterplots plot(x, y)

plot(pH, 
     Corg, 
     xlab = "pH value",
     ylab = "Corg content"
     )

1.3.5 Design options

Directly in the graphic routines (help with ?par)

  • Set colors with col = ...
  • Set symbol properties with pch = ..., sizes with cex = ...
  • Set title with main = ..., axis label with xlab = ..., ylab = ...
  • Set drawing area with xlim = ..., ylim = ...

After drawing a graphic

  • Complete lines and points with lines(...) or points(...) respectively.
  • Add captions (texts) with text(...) or mtext(...)
  • Complete titles with title(...)
  • Complete legend with legend(...)

Output adjustment

  • main Title of the diagram
  • xlab labeling of the x-axis
  • ylab labeling of the y-axis
  • breaks frequency classes number of bars
  • col color filling of the bars
  • cex gradual scaling of text size
hist(pH, 
     xlab = "pH value", 
     ylab = "Frequency", 
     main = "",
     breaks = seq(0, 10, 2), 
     col = "red",
     cex = 0.5
     )

1.4 All roads lead to Rome

When working with code, it is very common that there are several ways to solve the same problem.

Exercise: Frequency distribution grain size types KA4

  • Calculate all statistical parameters Create boxplots and histograms for clay content and pH (CaCl2)
  • Calculate correlations between for clay content, pH (CaCl2) and all other parameters
  • Evaluation of the results

And now for the frequency distribution grain size types KA4!

Hint: The three different version are made with a barplot(), plot() and hist()

1.4.1 Variant A

1.4.2 Variant B

1.4.3 Variant C

1.4.3.1 Solution

If you can’t solve it by yourself, you can look at the code and try to understand it, but I encourage you to try.

soil_type <- as.data.frame(table(soil_data$soil_type_ka4))
# Variant A
barplot(soil_type$Freq, space = 0, names = soil_type$Var1, 
                    xlab = "Soil type", ylab = "Frequency", 
                    main = "Frequency diagram KA4 Soil types")

# Variant B
plot(factor(soil_data$soil_type_ka4), 
                  xlab = "Soil type", 
                  ylab = "Frequency", 
                  main = "Frequency diagram KA4 Soil types")

# Variant C
hist(as.numeric(factor(soil_data$soil_type_ka4)),
                  breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
                  freq=TRUE,
                  xaxt = "n",
                  xlab = "Soil type",
                  ylab = "Frequency",
                  main = "Frequency diagram KA4 Soil types")
axis(side = 1, at = seq(0.5, 8.5, 1), labels = soil_type$Var1)

1.5 Multiple plots in one figure

When you set graphical parameters in Rstudio using this method, it is important to set the properties, plot the figures and reset the properties to the way they were before. Otherwise, every time we plot something, it will follow the set parameters.

par(                              # set or query graphical parameters
  mfrow = c(1, 3),                # 1 x 3 pictures on one plot, equivalent to mfcol = c(3, 1)
  mar = c(5.1, 4.1, 4.1, 2.1),    # margins as c(bottom, left, top, right)
  oma = c(0, 0, 0, 0),            # outer margins in lines of text as c(bottom, left, top, right)
  mgp = c(3, .1, 0),              # margins line for axis title, axis label and axis line
  las = 0,                        # label axis style 
  cex.lab = 1,                    # size of the labels
  cex.axis = 1,                   # size of the axis annotation 
  xpd = FALSE                     # If FALSE, all plotting is clipped to the plot region, if TRUE, all plotting is clipped to the figure region, and if NA, all plotting is clipped to the device region.
    )

plot(…); plot(…); plot(…)

par(mfrow = c(1, 1), 
    mar = c(5.1, 4.1, 4.1, 2.1), 
    oma = c(0, 0, 0, 0), 
    mgp = c(3, 1, 0), 
    las = 0, 
    cex.lab = 1, 
    cex.axis = 1,
    )

1.6 Plots file output

In R it is possible to export images in the most common formats and can be extended from packages. The output format must be specified and the file will be generated by default in our working directory.

tiff("filename.tiff", width = 21, height = 8, units = "cm", res = 300)

par(…)
    plot(…); plot(…); plot(…)
par(…)

dev.off()

2 Do I have to learn all these commands by heart?

No! R and Rstudio provide several tools to help us do this:

  • comments: when running our script, R will ignore all the text that start with #. Is a good practice to comment our code, is very useful if somebody else need to understand your code or for yourself in the future. In Rstudio, you can comment/uncomment the selected text with Ctrl+⇧Shift+C
  • autocomplete: Rstudio will suggest possible parameter for a function. Inside a function press the key Tab ↹, and it will display a list with the possible parameters for that function. move up and down and select with ↵ Return.
  • help: you can select a function and press F1, write ?function, help(function) or help('function') in the Rstudio console, and it will display the corresponding documentation for that function in the help panel.
  • cheatsheets: It is very common that for the most popular packages there are cheat sheets summarizing the most important functions

3 Libraries

Many functions are not included in the basic version of R. Therefore, there is an almost confusing variety of additional libraries for special applications. Examples:

  • ggplot2 and lattice for advanced graphics
  • dplyr, reshape2 and tidyr for data manipulation
  • sp and sf for spatial data (shapefiles, geopackage, etc.)
  • raster, terra, stars for spatial raster data
  • caret as wrapping function for machine-learning libraries
  • RQGIS and RSAGA as bridges to QGIS and SAGA GIS
  • Currently, 18875 available packages in the CRAN package repository
  • some others available on Github

3.1 How to load a library?

In the console or the script

install.packages(ggplot2)       # install a library
library(ggplot2)                # loading a library
require(ggplot2)
detach(ggplot2)                 # Removing a loaded library

rp <- c("DescTools", "rstudioapi")              # required packages
ip <- rp[!(rp %in% installed.packages()[, "Package"])]  # subset packages

if(length(ip)) install.packages(ip, dependencies = TRUE)    # install packages

lapply(rp, require, character.only = TRUE)          # load packages
rm(np, rp)                              # remove vectos

4 Spatial modeling

Let’s start to explore the data with the tools we already know. In this case, we will use the variable P as example.

plot(soil_data$X_coord, 
     soil_data$Y_coord, 
     cex=soil_data$P*0.05)

Model: Limited representation of reality

Spatial modeling: Image of states or processes of locatable objects and their characteristic values

Data for spatial modeling:

  • Position in space (e.g. xy-coordinate)
  • Attributes/properties (e.g. sand content, carbon content, N, P, K, etc.)
library(sp)

soil_data_copy <- soil_data
coordinates(soil_data_copy) = ~X_coord + Y_coord
bubble(soil_data_copy,"pH")

4.1 Interpolation

library(gstat)
# Determination of the extension (xmin, xmax, ymin, ymax)
xmin <- min(soil_data$X_coord) - 100
xmax <- max(soil_data$X_coord) + 100
ymin <- min(soil_data$Y_coord) - 100
ymax <- max(soil_data$Y_coord) + 100
# Determination of the resolution (cellsize)
cellsize = 10
# Spanning a grid (raster format) for interpolation
grd<- expand.grid(x=seq(from= xmin, to= xmax, by= cellsize), y=seq(from= ymin, to= ymax, by= cellsize))
coordinates(grd) <- ~ x+y
gridded(grd) <- TRUE

4.1.1 IDW

Gstat library has for this the function

dataset.idw <- idw(soil_data$pH ~ 1, location = ~X_coord+Y_coord, soil_data, grd)
## [inverse distance weighted interpolation]
image(dataset.idw["var1.pred"],col=topo.colors(20))

points(soil_data$X_coord, soil_data$Y_coord, col ="blue")

4.1.2 Kriging

#Define spatial objects
coordinates(soil_data) = c("X_coord", "Y_coord")
# Variogram
v <- variogram(pH ~ 1, locations = coordinates, data = soil_data, width = cellsize * 0.5)
# Variogram fit
v.fit <- fit.variogram(v, fit.method = TRUE, model = vgm(0.1, "Gau", 600))
# Output of the variogram
plot(v, model = v.fit)

4.1.2.1 Applying the model fit with the krige() function.

soil_data <- read.table("data/soil_data.txt", header = TRUE)

# ordinary kriging:
z <- krige(formula = pH ~ 1, locations = ~ X_coord + Y_coord, data = soil_data, newdata = grd, model = v.fit, nmax = 500)
## [using ordinary kriging]
image(z, col = topo.colors(20))
points(soil_data$X_coord, soil_data$Y_coord)
title("Ordinary kriging Interpolation of pH-Value") 

# simple kriging:
z <- krige(pH ~ 1, locations = ~ X_coord + Y_coord, data = soil_data, newdata = grd, model = v.fit, nmax = 500, beta = mean(soil_data$pH))
## [using simple kriging]
image(z, col = topo.colors(20))
points(soil_data$X_coord, soil_data$Y_coord)
title("Simple kriging Interpolation of pH-Value") 

Exercise:

  • Tests Simple Kriging in addition to Ordinary Kriging,
  • Differences theoretically and in visual evaluation?

4.1.2.2 Model validation

soil_data <- read.table("data/soil_data.txt", header = TRUE)

x <- krige.cv(pH ~ 1, ~ X_coord+Y_coord, data = soil_data, maxdist = 500, nfold = 10)

cor(x$observed, x$var1.pred) * cor(x$observed, x$var1.pred) # Rsquared
## [1] 0.6606139
sqr_error <- (x$var1.pred - x$observed )^2
sqrt(sum(sqr_error)/length(sqr_error))       # RMSE
## [1] 0.3770356

4.1.3 ML: decision trees

This is a simple example about how to built a decision tree. For this example, we will use the package rpart

library(rpart)

soil_model <- rpart(pH ~ P + coarse_sand, data = soil_data, method = "anova", control = rpart.control(cp = 0))

4.1.3.1 Visualization of a decision tree

The structure of a decision/classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

# Load the rpart.plot package
library(rpart.plot)

# Plot the soil_model with default settings
rpart.plot(soil_model)

4.1.3.2 Why do some branches split?

A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree. This process always looks to create the split resulting in the greatest improvement to purity (subgroup homogeneity).

Classification trees tend to grow easily due to they divide the data using one variable value, producing the most pure partition each time:

4.1.3.3 Creating random test datasets

Before building a more sophisticated pH model, it is important to hold out a portion of the data to simulate how well it will predict the pH of unknown data points.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training. Then we can use the resulting vector of row IDs to subset the samples into training and testing datasets as:

# Determine the number of rows for training
nrow(soil_data)
## [1] 57
nrow(soil_data) * 0.75
## [1] 42.75
# Create a random sample of row IDs
sample_rows <- sample(nrow(soil_data), nrow(soil_data) * 0.75)

# Create the training dataset
soil_train <- soil_data[sample_rows, ]

# Create the test dataset
soil_test <- soil_data[-sample_rows, ]
str(soil_train)
## 'data.frame':    42 obs. of  21 variables:
##  $ ID           : int  44 22 47 36 25 18 39 56 35 52 ...
##  $ X_coord      : int  4426941 4427303 4427237 4427373 4427785 4427780 4427135 4426093 4427179 4426676 ...
##  $ Y_coord      : int  5749277 5749690 5748957 5749355 5749011 5749847 5749339 5749162 5749506 5749108 ...
##  $ Id           : int  154 132 157 146 135 128 149 166 145 162 ...
##  $ pH           : num  7.2 7.2 5.6 7.1 7.1 6.2 7.2 7.2 7.3 7.3 ...
##  $ P            : num  15 25 6.4 11 7.6 31.3 7.7 21.2 12.2 14.2 ...
##  $ K            : num  28 21 20 24 17 31 29 19 19 18 ...
##  $ Mg           : num  7.7 3 8.2 7.5 8.1 9.7 6.2 6.2 6.4 7.5 ...
##  $ horiz        : chr  "Ap" "Ap" "Ap" "Ap" ...
##  $ upper_depth  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lower_depth  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ stones       : num  1.1 2.6 0.5 11.1 3.2 0.5 15.9 1.3 0.2 5 ...
##  $ coarse_sand  : num  1.7 28.8 0.5 2.8 0.3 1.3 7.8 1.3 0.3 1.3 ...
##  $ medium_sand  : num  2 7.5 2.5 2.8 0.7 3.2 5.7 1.8 0.9 2.1 ...
##  $ fine_sand    : num  22 23.1 8.5 16.8 4.6 5.7 14.6 15.1 7.6 18.2 ...
##  $ coarse_silt  : num  40.2 19.1 44.5 30.4 57.4 47.8 26 50 50.3 43.7 ...
##  $ medium_silt  : num  12.4 5.8 17.8 9.3 14.8 11.7 12.2 12.5 16.5 12.3 ...
##  $ fine_silt    : num  6.5 3.8 7.5 8.5 6.6 6.4 9 4.4 7.7 4.6 ...
##  $ clay         : num  15.2 11.9 18.7 29.4 15.6 23.9 24.7 14.9 16.7 17.8 ...
##  $ soil_type_ka4: chr  "Uls" "Sl3" "Ut4" "Lt2" ...
##  $ pH_cacl2     : num  7.16 7.29 5.01 7.31 7.49 5.49 7.43 7.31 7.46 7.41 ...
str(soil_test)
## 'data.frame':    15 obs. of  21 variables:
##  $ ID           : int  0 3 4 7 9 12 15 16 17 26 ...
##  $ X_coord      : int  4428062 4428092 4428089 4428129 4427954 4427950 4427786 4427579 4427668 4427618 ...
##  $ Y_coord      : int  5749930 5749577 5749360 5748889 5748947 5749243 5749495 5749612 5749765 5749083 ...
##  $ Id           : int  110 113 114 117 119 122 125 126 127 136 ...
##  $ pH           : num  6.3 6.6 6 5.2 5.3 5.5 6.3 6.9 6.7 6.2 ...
##  $ P            : num  25.4 7.8 4.6 5.5 9.5 7.1 9.5 21.1 17 5.9 ...
##  $ K            : num  16 18 16 13 17 21 34 25 24 22 ...
##  $ Mg           : num  7.3 7.8 7.6 5.9 6.7 7 7.6 5.1 6.6 8.7 ...
##  $ horiz        : chr  "Ap" "Ap" "Ap" "Ap" ...
##  $ upper_depth  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lower_depth  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ stones       : num  1.1 0.04 0.2 1.3 24.7 33.9 1 8.7 12.7 0.6 ...
##  $ coarse_sand  : num  1.4 0.8 0.6 2.2 8.9 7.3 2.2 2.4 5.7 0.8 ...
##  $ medium_sand  : num  5.8 1.1 1.1 2.9 9.7 8.8 2.6 3 7.3 1.7 ...
##  $ fine_sand    : num  27.6 1.7 1.7 3.9 9.3 7.3 4.5 7.3 7.7 6.8 ...
##  $ coarse_silt  : num  39.5 53.3 52.3 52.4 42.9 44.3 51.5 44.6 42.9 51.5 ...
##  $ medium_silt  : num  11 13.7 16.1 14.5 12 12.5 13.4 12.5 10.3 15.9 ...
##  $ fine_silt    : num  2.9 6.5 7.6 7.2 5.6 4.3 6 7.6 4 7.2 ...
##  $ clay         : num  11.8 22.9 20.6 16.9 11.6 15.5 19.8 22.6 22.1 16.1 ...
##  $ soil_type_ka4: chr  "Uls" "Ut4" "Ut4" "Ut3" ...
##  $ pH_cacl2     : num  7.33 6.79 5.77 5.79 7.25 5.86 5.88 7.05 7.1 6.22 ...

4.1.4 ML: random forest